How complete is the CDC's COVID-19 case surveillance data for race/ethnicity for states and counties?

Status: Draft

Jan 28, 2021

Background

The disparities in the COVID-19 pandemic along racial and ethnic lines have exposed longstanding health inequities in the U.S., as made clear by multiple analyses of cases and deaths (CRDT, NYT, APM, KFF, NPR). However, these analyses were all based on incomplete data due to the fragmented data landscape for race/ethnicity breakdowns, which has largely been largely left to non-governmental organizations collecting data from individual state public health websites. For case data in particular, as opposed to deaths data, the CDC has only published public race/ethnicity data for cases at the U.S. level, not the state or county levels. Indeed, ASPE, an agency within HHS, wrote in Oct 2020 that "The volunteer-based COVID tracking project has created the most comprehensive centralized resource for race and ethnicity data at the state level."

In July 2020, the New York Times (NYT) published The Fullest Look Yet at the Racial Inequity of Coronavirus, a one-time analysis of data from the CDC obtained via FOIA and legal action that contained county-level case data with race/ethnicity up to May 28, 2020. While several non-governmental organizations have taken it upon themselves to gather data for total case counts at the county level (NYT, JHU, USAFacts), none of them have collected race/ethnicity data, which would be a huge undertaking due to the non-uniformity of race/ethnicity categories in state and local public health websites.

In Nov 2020, the CDC made some of the case data that the NYT obtained public: county-level totals in a dashboard and public data about race/ethnicity with additional dashboards, but without state and county details. They also released restricted access data with race/ethnicity, state, and county available upon request. The CDC's initial restricted access data agreement did not allow for county-level analyses to be made public, but an updated data agreement from Dec 14, 2020 allowed such public analyses. In Jan 2021, the Morehouse School of Medicine's Satcher Health Leadership Institute (MSM/SHLI), in collaboration with Citizens for Responsibility and Ethics in Washington (CREW) and Google.org, applied for and got access to this data.

The CDC Restricted Access data enabled us to complete the first public analysis of race/ethnicity disparities across the U.S. at the county level since the NYT analysis in July. However, the underlying data has significant completeness issues; e.g., only 80% of total cases are included and only 54% of cases have known race/ethnicity and county. In the table below, which compares the CDC/NYT data to the CDC data and totals from the Covid Tracking Project (CTP), we can see that some of the completeness measures did improve since the NYT obtained the data.

Screen Shot 2021-01-26 at 10.15.52 AM.png

Sources: NYT article and The Daily podcast episode about the article.

Overview

The goal of this analysis is to assess the completeness of the CDC's Restricted Access data and its feasibility in examining disparities in race/ethnicity for COVID-19 cases at the county level. We will first assess the completeness of the data on its own by looking at which fields are viable for analysis. We will next compare the total case counts in the restricted access data to two comparable public datasets at the state and county levels. We will also compare the cases with known race/ethnicity at the state level to the Covid Racial Data Tracker (CRDT) data.

The top-level data completeness findings are:

  1. Data fields: Most fields in the CDC's restricted access data are missing too many values to be useful. The only fields that we used were state, county, age, sex, and race/ethnicity. Race/ethnicity was only known for 55% of cases, as opposed to 97%-100% for the other fields.
  2. Total case count: The CDC data does not contain the total number of cases expected for some states and counties as compared to comparable data. While it is expected that the CDC will lag in some cases, the time lag alone can't explain some of the discrepancies.
  3. Race/ethnicity: Race/ethnicity data availability is highly variable across different states, and this is common to both the CDC and CRDT. However, the CDC data has both fewer cases and a lower percentage of cases with known race/ethnicity than the CRDT data in most states.

After examining the completeness of the data, we will finally examine race/ethnicity disparities at the county level for data up to Dec 16.

Note that we will not analyze any data about deaths or hospitalizations. While there are fields in the CDC data that indicate if the person died or was hospitalized, they are missing too many values to be reliable. There are also alternate data sources, such as the CDC Provisional Deaths data, which are based on different underlying data that may be more complete than the case data we are looking at here.

Completeness Analysis

Data Fields

The restricted access data contains 32 fields, which are described on the CDC website. The public version of the restricted access data contains 12 of those fields. The data comes from this case report form that is a dense, two-page form to get information about each lab-confirmed or probable COVID-19 case. The CDC has extensive FAQs about this surveillance data, one of which is about completeness:

How complete are the data that the CDC receives about COVID-19 cases?

The COVID-19 pandemic has put unprecedented demands on the public health data supply chain. In many states, the large number of COVID-19 cases has severely strained the ability of hospitals, healthcare providers, and laboratories to report cases with complete demographic information, such as race and ethnicity. The unprecedented volume of cases has also limited the ability of state and local health departments to conduct thorough case investigations and collect all requested case data.

As a result, many COVID-19 case notifications submitted to CDC do not have complete information on patient demographics; signs and symptoms of illness; underlying health conditions; characteristics of hospitalizations such as ventilator use; clinical outcomes; exposures; and factors that may put people at higher risk for severe disease. Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Because of missing data, analyses of these data elements are likely an underestimate of the true occurrence.

CDC data fields

In [ ]:
#@title
import pandas as pd
import altair as alt
from vega_datasets import data

%load_ext google.colab.data_table

from google.colab import auth
auth.authenticate_user()

alt.renderers.set_embed_options(actions=False)
In [70]:
#@title
def FieldAnalysis(project_id, table, field_list):
  dict = {}
  for field in field_list:
      dict[field] = [0.0, 0.0, 0.0, 0.0]
  unknowns = pd.DataFrame(dict, index=['Unknown', 'Missing', 'NA', 'Known'])
  field_series = []
  value_series = []
  percent_series = []

  for field in field_list:
    field_unknowns_query = ('''
    SELECT
      %s,
      count(*) as cases
    FROM
      %s
    GROUP BY
      %s
    ''')
    query = field_unknowns_query % (field, table, field)
    field_unknowns_df = pd.io.gbq.read_gbq(query, project_id=project_id)
    field_unknowns_df.set_index(field, inplace=True)
    field_unknowns_df.index = field_unknowns_df.index.fillna('Null')

    missing_count = 0
    if 'Missing' in field_unknowns_df.index:
      missing_count += field_unknowns_df.loc['Missing'].cases
    if 'Null' in field_unknowns_df.index:
      missing_count += field_unknowns_df.loc['Null'].cases
    #if field_unknowns_df.index.isnull().any():
    #  missing_count += field_unknowns_df.loc[field_unknowns_df.index.isnull()].cases
    unknowns.loc['Missing', field] = missing_count / field_unknowns_df.cases.sum()

    if 'Unknown' in field_unknowns_df.index:
      unknowns.loc['Unknown', field] = field_unknowns_df.loc['Unknown'].cases / field_unknowns_df.cases.sum()
    if 'NA' in field_unknowns_df.index:
      unknowns.loc['NA', field] = field_unknowns_df.loc['NA'].cases / field_unknowns_df.cases.sum()
    unknowns.loc['Known', field] = 1 - (unknowns.loc['Missing', field] +
                                        unknowns.loc['Unknown', field] +
                                        unknowns.loc['NA', field])
    field_series.extend([field, field, field, field])
    value_series.extend(['Known', 'Supressed', 'Unknown', 'Missing'])
    percent_series.extend([unknowns.loc['Known', field],
                           unknowns.loc['NA', field],
                           unknowns.loc['Unknown', field],
                           unknowns.loc['Missing', field]])
  test = pd.DataFrame.from_dict({'field': field_series,
                               'value': value_series,
                               'percent': percent_series})
  return alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', axis=alt.Axis(format='%'), title=''),
      y=alt.Y('field', sort='x', title='Field'),
      color=alt.Color('value', scale=alt.Scale(scheme='category20'), title='Value'),
      order=alt.Order('field:N'),
      tooltip=[
                  alt.Tooltip('field:N', title='Field'),
                  alt.Tooltip('value:N', title='Value'),
                  alt.Tooltip('percent:Q', format=',.0%', title='Percent'),
      ]
  )

Based on our analysis of the CDC data up to Dec 16, 2020, the only fields that are available for more than 50% of the cases are the date that the case was first reported to the CDC, the status of the case (lab-confirmed or probable), state, county, sex, age, and race/ethnicity, which are shown in the chart below. All other fields, including whether the person died or was hospitalized, are known for fewer than 50% of the cases.

In [71]:
#@title
field_list = ['cdc_case_earliest_dt', 'current_status', 'res_state', 'res_county', 'sex', 'age_group', 'race_ethnicity_combined']
project_id = 'msm-secure-data-1b'
table = '`msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`'
FieldAnalysis(project_id, table, field_list).display()

Race/ethnicity is known for only 55% of cases, while the other fields above are known for 97%-99% of cases. The 45% of cases without known race/ethnicity were either marked as "Unknown" on the case report form (35%), missing due to being left blank on the form (4%), or suppressed for privacy reasons for small geographic and/or demographic population groups (2%).

The CDC discussed the incompleteness of race/ethnicity data in their case data FAQs:

Most states have demographic factors like age and sex for most reported cases. However, in many states, the large number of COVID-19 cases has severely strained the ability to report cases with complete demographic information for race and ethnicity. With thousands of cases being reported, completeness of these elements is unlikely to improve in the immediate future for some jurisdictions.

The remaining fields, including whether the person died or was hospitalized, are all known for fewer than 50% of cases.

In [72]:
#@title
field_list = ['death_yn', 'hosp_yn', 'icu_yn', 'onset_dt', 'pos_spec_dt', 'hc_work_yn',
              'pna_yn', 'abxchest_yn', 'acuterespdistress_yn', 'mechvent_yn', 'fever_yn', 'sfever_yn', 'chills_yn', 'myalgia_yn', 'runnose_yn',
              'sthroat_yn', 'cough_yn', 'sob_yn', 'nauseavomit_yn', 'headache_yn', 'abdom_yn', 'diarrhea_yn', 'medcond_yn']
project_id = 'msm-secure-data-1b'
table = '`msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`'
FieldAnalysis(project_id, table, field_list).display()

The CDC also commented on these fields in their case data FAQs:

Because of the volume of cases, most health departments are unable to conduct investigations of every case to obtain additional information. Because of this, most case reports are missing data on patient demographics, symptoms, underlying health conditions, characteristics of hospitalizations such as ventilator use, and other factors such as recent travel history.

CREW/CDC data fields

The case report form contains many more fields, but unfortunately, the data gets less complete as you go down the form. CREW obtained a version of this data via FOIA that contains 101 fields with data up to Aug 25, 2020. Several of the additional fields from that dataset are shown below; the field with the most known data is whether the case was associated with an outbreak, but even that is only known for 30% of cases.

In [73]:
#@title
field_list = ['death_week', 'icu_length', 'hosp_length', 'translator_yn', 'housing', 'exp_work_critical', 'outbreak_associated',
              'rigors_yn', 'taste_yn', 'fatigue_yn', 'wheezing_yn', 'diffbreathing_yn', 'chestpain_yn', 'test_pcr', 'test_serologic',
              'exp_adultfacility', 'exp_airport', 'exp_animal', 'exp_community', 'exp_gathering', 'exp_contact', 'exp_correctional',
              'exp_ship', 'exp_house', 'exp_other', 'exp_school', 'exp_othcountry', 'exp_unk', 'exp_work']
project_id = 'msm-internal-data'
table = '`msm-internal-data.crew.covid_case_surveillance`'
FieldAnalysis(project_id, table, field_list).display()

Total Case Count

The first step to evaluating the completeness of the CDC data is to check the total case counts at the U.S., state, and county levels against known accurate data sources that aggregate state and local public health websites. The CDC case data FAQs say that we should not expect case data to always match the more accurate aggregate data, but that's a tradeoff we must make to get more detailed demographic information:

Aggregate counts provide the most up-to-date validated numbers on cases and deaths.

CDC receives the line-level data primarily from state health departments without personal identifiers such as names or home addresses. Because it can be time-consuming for jurisdictions to collect the additional information, these data can lag behind the aggregate counts. Although CDC receives this information for most cases, it does not receive it for all cases.

Many public health websites do contain race/ethnicity details, but they do not all use the same standard race/ethnicity categories (CRDT analysis). So, we must sacrifice accuracy and timeliness to get standardized race/ethnicity data on cases across states and counties.

We will compare the CDC data against two sources of aggregate data: The Covid Racial Data Tracker (CRDT) and the NYT's public data, which are both updated on a regular basis (CRDT twice a week, NYT daily) and come from state and local public health websites. CRDT is the only source for case data with race/ethnicity breakdowns, but there are several sources for county-level aggregate case data in addition to the NYT, such as JHU and USAFacts (this paper analyzes the differences between those sources at the state level up to July for cases and deaths).

The table below compares geographic vs. race/ethnicity availability for these three different data sources:

  • CDC: CDC Case Surveillance Restricted Access Data
  • CRDT: Covid Racial Data Tracker Public Data
  • NYT: New York Times COVID-19 Public Data

Screen Shot 2021-01-26 at 1.26.55 AM.png

Because the CDC is the only data source that has race/ethnicity at the county level, the most similar data for purposes of comparison are (1) CRDT data at the state level with race/ethnicity, and (2) NYT data at the county level with no race/ethnicity.

We will compare across these data sources up to Dec 16, 2020, which is the latest reporting date in the CDC data. We expect to see some variation in the case counts due to lags in reporting the data, but we don't expect that time lags can explain large percentages of missing cases.

Baseline: NYT vs. CRDT

To get a baseline of how much we could expect the CDC case counts to match the CRDT or NYT, we can see how closely the CRDT and NYT match each other. Each dot below is a state (hover to see details), and the black line shows where the NYT and CRDT case counts are equal.

In [79]:
#@title
CASES = 'Cases'
HOSPITALIZATIONS = 'Hospitalizations'
DEATHS = 'Deaths'
HCW_CASES = 'Healthcare Worker Cases'

DATASET = 'cdc'
#DATASET = 'crew'

metric = CASES
#metric = HOSPITALIZATIONS
#metric = DEATHS
#metric = HCW_CASES

project_id = 'msm-secure-data-1b'
table = '`msm-secure-data-1b.ndunlap_secure.cdc_restricted_access_20201231`'
date = 'Dec 16'

if DATASET == 'crew':
  project_id = 'msm-internal-data'
  table = '`msm-internal-data.crew.covid_case_surveillance`'
  date = 'Aug 11 (CREW)'

race_ethnicity_groups = ['black', 'hispanic', 'aian', 'nhpi', 'asian', 'white', 'other']
#race_ethnicity_groups = ['black', 'white'] # for hc_work_yn coverage
In [80]:
#@title
states_to_fips = {'AL': 1, 'AK': 2, 'AZ': 4, 'AR': 5, 'AS': 3, 'CA': 6, 'CO': 8, 'CT': 9, 'DC': 11, 'DE': 10, 'FL': 12, 'GA': 13, 'GU': 14, 'HI': 15, 'ID': 16, 'IL': 17, 'IN': 18, 'IA': 19, 'KS': 20, 'KY': 21, 'LA': 22, 'ME': 23, 'MD': 24, 'MA': 25, 'MI': 26, 'MN': 27, 'MS': 28, 'MO': 29, 'MT': 30, 'NE': 31, 'NV': 32, 'NH': 33, 'NJ': 34, 'NM': 35, 'NY': 36, 'NYC': 36, 'NC': 37, 'ND': 38, 'OH': 39, 'OK': 40, 'OR': 41, 'PA': 42, 'PR': 43, 'RI': 44, 'SC': 45, 'SD': 46, 'TN': 47, 'TX': 48, 'UT': 49, 'VT': 50, 'VA': 51, 'VI': 52, 'WA': 53, 'WV': 54, 'WI': 55, 'WY': 56, 'AS': 60, 'GU': 66, 'MP': 69, 'PR': 72, 'VI': 78}

crdt_query = ('''
SELECT
  State as state,
  Cases_Total as crdt_cases,
  #Deaths_Total as crdt_deaths,
  Cases_Total - Cases_Unknown as crdt_known_cases,
  #Deaths_Total - Deaths_Unknown as crdt_known_deaths,
  ROUND(1 - Cases_Unknown / Cases_Total, 4) as crdt_known_cases_percent,
  #ROUND(1 - Deaths_Unknown / Deaths_Total, 4) as crdt_known_deaths_percent,  
FROM `msm-secure-data-1b.ndunlap_secure.crdt`
WHERE
  date = 20201216
''')
crdt_df = pd.io.gbq.read_gbq(crdt_query, project_id=project_id)
crdt_df.set_index('state', inplace=True)
In [81]:
#@title
nyt_states_query = ('''
SELECT
  state_name,
  state_fips_code,
  confirmed_cases as nyt_cases,
  deaths as nyt_deaths
FROM `bigquery-public-data.covid19_nyt.us_states`
WHERE
  date = DATE(2020, 12, 16) AND
  state_fips_code IS NOT NULL
''')
nyt_states_df = pd.io.gbq.read_gbq(nyt_states_query, project_id=project_id)
nyt_states_df.state_fips_code.unique()
nyt_states_df = nyt_states_df[nyt_states_df.state_name != 'Puerto Rico']
nyt_states_df = nyt_states_df[nyt_states_df.state_name != 'Guam']
nyt_states_df = nyt_states_df[nyt_states_df.state_name != 'Virgin Islands']
nyt_states_df = nyt_states_df[nyt_states_df.state_name != 'Northern Mariana Islands']
nyt_states_df = nyt_states_df[nyt_states_df.state_name != 'American Samoa']
nyt_states_df['state_fips_code'] = nyt_states_df.state_fips_code.astype(int)
nyt_states_df.set_index('state_fips_code', inplace=True)
In [82]:
#@title
crdt_df.reset_index(inplace=True)
crdt_df['state_fips_code'] = crdt_df.state
crdt_df = crdt_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_df.set_index('state_fips_code', inplace=True)
nyt_crdt_merged_df = nyt_states_df.join(crdt_df, on="state_fips_code", how='inner', lsuffix='_left', rsuffix='_right')
In [ ]:
#@title
nyt_crdt_merged_df['percent'] = round(nyt_crdt_merged_df.nyt_cases / nyt_crdt_merged_df.crdt_cases, 2)
nyt_crdt_merged_df
nyt_crdt_merged_df.reset_index(inplace=True)

below_15 = len(nyt_crdt_merged_df[nyt_crdt_merged_df.percent < .85]) / len(nyt_crdt_merged_df)
above_15 = len(nyt_crdt_merged_df[nyt_crdt_merged_df.percent > 1.15]) / len(nyt_crdt_merged_df)
print('between +/-15%: ', round(1 - below_15 - above_15, 2))
nyt_crdt_merged_df.percent.describe()
In [84]:
#@title
tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('nyt_cases:Q', format=',', title='NYT cases'),
              alt.Tooltip('crdt_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of NYT to CRDT'),
]

plot = alt.Chart(nyt_crdt_merged_df).mark_circle(size=60).encode(
    alt.X('crdt_cases:Q', axis=alt.Axis(title='CRDT cases'),
        scale=alt.Scale(domain=(0, 2000000))
    ),
    alt.Y('nyt_cases:Q', axis=alt.Axis(title='NYT cases'),
        scale=alt.Scale(domain=(0, 2000000))
    ),
    color=alt.Color('percent',
                    scale=alt.Scale(scheme='blueorange',
                                    reverse=True,
                                               domain=[0, 2],
                                               clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'),
    tooltip=tooltips,
).properties(
    width=350,
    height=350,
)

line = pd.DataFrame({
    'x': [0, 2000000],
    'y': [0, 2000000],
})

line_plot = alt.Chart(line).mark_line(color='black').encode(
    x='x',
    y='y',
).properties(
    width=350,
    height=350,
)

test = (plot + line_plot).configure_mark(
    stroke='grey'
).properties(
    title='Ratio of NYT to CRDT Cases by State as of Dec 16',
    width=350,
    height=350,
).display()

The ratio of NYT to CRDT cases is between 0.97 and 1.11 for all states:

  • Average = 1.01
  • Median = 1.00
  • Min = 0.97 (Tennessee)
  • Max = 1.11 (Georgia)
  • Percent between 0.85 and 1.15 = 100%

We can also view these ratios on a map (hover over states for details).

In [85]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('nyt_cases:Q', format=',', title='NYT cases'),
              alt.Tooltip('crdt_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of NYT to CRDT'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(nyt_crdt_merged_df, 'state_fips_code', ['percent', 'state', 'nyt_cases', 'crdt_cases'])
  ).encode(
      alt.Color('percent',  
                type='quantitative', 
                legend=alt.Legend(format='.2f'),
                scale=alt.Scale(scheme='blueorange',
                                reverse=True,
                                domain=[0, 2],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='#F1F1F1',
      stroke='white'
).project('albersUsa')

layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=500,
      height=400,
      title='Ratio of NYT Cases to CRDT Cases as of Dec 16'
).configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()

States: CDC vs. CRDT

We can see below that the CDC case counts differ from the CRDT case counts much more drastically than the NYT did.

In [86]:
#@title
states_to_fips = {'AL': 1, 'AK': 2, 'AZ': 4, 'AR': 5, 'AS': 3, 'CA': 6, 'CO': 8, 'CT': 9, 'DC': 11, 'DE': 10, 'FL': 12, 'GA': 13, 'GU': 14, 'HI': 15, 'ID': 16, 'IL': 17, 'IN': 18, 'IA': 19, 'KS': 20, 'KY': 21, 'LA': 22, 'ME': 23, 'MD': 24, 'MA': 25, 'MI': 26, 'MN': 27, 'MS': 28, 'MO': 29, 'MT': 30, 'NE': 31, 'NV': 32, 'NH': 33, 'NJ': 34, 'NM': 35, 'NY': 36, 'NYC': 36, 'NC': 37, 'ND': 38, 'OH': 39, 'OK': 40, 'OR': 41, 'PA': 42, 'PR': 43, 'RI': 44, 'SC': 45, 'SD': 46, 'TN': 47, 'TX': 48, 'UT': 49, 'VT': 50, 'VA': 51, 'VI': 52, 'WA': 53, 'WV': 54, 'WI': 55, 'WY': 56, 'AS': 60, 'GU': 66, 'MP': 69, 'PR': 72, 'VI': 78}
compare_cases_query = ('''
SELECT
  res_state,
  COUNT(*) as cdc_cases
FROM
  %s
GROUP BY
   res_state
''')
# Unused: for CDC vs. NYT states case totals comparison.
#states_df = pd.io.gbq.read_gbq(compare_cases_query % table, project_id=project_id)
#states_df = states_df.replace(to_replace={'res_state': states_to_fips})
#states_df = states_df[states_df.res_state != 'Unknown']
#states_df = states_df[states_df.res_state != 'NA']
#states_df = states_df[states_df.res_state != 'OCONUS']
#states_df.rename(columns={'res_state': 'state_fips_code'}, inplace=True)
#states_df['state_fips_code'] = states_df.state_fips_code.astype(int)
#states_df.set_index('state_fips_code', inplace=True)
In [ ]:
#@title
states_df = pd.io.gbq.read_gbq(compare_cases_query % table, project_id=project_id)
states_df.rename(columns={'res_state': 'state'}, inplace=True)
states_df.set_index('state', inplace=True)

crdt_query = ('''
SELECT
  State as state,
  Cases_Total as crdt_cases,
  #Deaths_Total as crdt_deaths,
  Cases_Total - Cases_Unknown as crdt_known_cases,
  #Deaths_Total - Deaths_Unknown as crdt_known_deaths,
  ROUND(1 - Cases_Unknown / Cases_Total, 4) as crdt_known_cases_percent,
  #ROUND(1 - Deaths_Unknown / Deaths_Total, 4) as crdt_known_deaths_percent,  
FROM `msm-secure-data-1b.ndunlap_secure.crdt`
WHERE
  date = 20201216
''')
crdt_df = pd.io.gbq.read_gbq(crdt_query, project_id=project_id)
crdt_df = crdt_df[crdt_df.state != 'PR']
crdt_df = crdt_df[crdt_df.state != 'GU']
crdt_df = crdt_df[crdt_df.state != 'VI']
crdt_df = crdt_df[crdt_df.state != 'MP']
crdt_df = crdt_df[crdt_df.state != 'AS']
crdt_df.set_index('state', inplace=True)
crdt_merged_df = states_df.join(crdt_df, on="state", how='inner', lsuffix='_left', rsuffix='_right')
crdt_merged_df.reset_index(inplace=True)
crdt_merged_df['state_fips_code'] = crdt_merged_df.state
crdt_merged_df = crdt_merged_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_cases, 4)
#crdt_merged_df['percent'] = round((crdt_merged_df.cdc_cases - crdt_merged_df.crdt_cases) / crdt_merged_df.crdt_cases, 4)
#crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_deaths, 4)
#crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_known_cases, 4)
#crdt_merged_df['percent'] = round(crdt_merged_df.cdc_cases / crdt_merged_df.crdt_known_deaths, 4)
In [ ]:
#@title
below_15 = len(crdt_merged_df[crdt_merged_df.percent < .85]) / len(crdt_merged_df)
above_15 = len(crdt_merged_df[crdt_merged_df.percent > 1.15]) / len(crdt_merged_df)
print('between +/-15%: ', round(1 - below_15 - above_15, 2))
below_50 = len(crdt_merged_df[crdt_merged_df.percent < .5]) / len(crdt_merged_df)
above_50 = len(crdt_merged_df[crdt_merged_df.percent > 1.55]) / len(crdt_merged_df)
print('between +/-50%: ', round(1 - below_50 - above_50, 2))
crdt_merged_df.percent.describe()
In [88]:
#@title
tooltips = [alt.Tooltip('state:N', title='State'),
              #alt.Tooltip('crdt_known_cases:Q', format=',', title='CDC cases'),
              #alt.Tooltip('cdc_known_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('cdc_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('crdt_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to CRDT'),
]

plot = alt.Chart(crdt_merged_df).mark_circle(size=60).encode(
    alt.X('crdt_cases:Q',
    #alt.X('crdt_known_cases:Q',
        scale=alt.Scale(domain=(0, 2000000)),
        axis=alt.Axis(title='CRDT cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    alt.Y('cdc_cases:Q',
    #alt.Y('cdc_known_cases:Q',
        scale=alt.Scale(domain=(0, 2000000)),
        axis=alt.Axis(title='CDC cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    color=alt.Color('percent', scale=alt.Scale(scheme='blueorange',
                                               reverse=True,
                                               domain=[0, 2], clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'
                    ),
 tooltip=tooltips,
).properties(
    width=350,
    height=350
)

line = pd.DataFrame({
    'x': [0, 2000000],
    'y': [0, 2000000],
    #'x': [0, 1200000],
    #'y': [0, 1200000],
})

line_plot = alt.Chart(line).mark_line(color= 'black').encode(
    x='x',
    y='y',
).properties(
    width=300,
    height=300
)

scatter = (plot + line_plot).properties(
    title='Ratio of CDC to CRDT Cases by State as of Dec 16'
    ).configure_mark(stroke='grey')
scatter.display()

The ratio of CDC to CRDT cases is between 0.03 and 1.64 for the 50 states plus D.C.:

  • Average = 0.79
  • Median = 0.97
  • Min = 0.03 (Wyoming)
  • Max = 1.64 (Alaska)
  • Percent between 0.85 and 1.15 = 63% (32 states)
  • Percent between 0.50 and 1.50 = 71% (36 states)

In other words, 63% of states in the CDC data are within +/-15% of the CRDT case counts and 71% of states are within +/-50% of those counts.

Here are the ratios shown on a map:

In [90]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')
tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('cdc_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('crdt_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to CRDT')
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent', 'state', 'cdc_cases', 'crdt_cases'])
  ).encode(
      alt.Color('percent',  
                type='quantitative', 
                legend=alt.Legend(format='.2f'),
                scale=alt.Scale(scheme='blueorange',
                                reverse=True,
                                domain=[0, 2],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='#F1F1F1',
      stroke='white'
).project('albersUsa')

crdt_layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=500,
      height=400,
      title='Ratio of CDC Cases to CRDT Cases as of Dec 16'
)

crdt_layered_map.configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()

The 32 states that were within +/-15% of the CRDT data could plausibly be off due to time lags in reporting cases to the CDC vs. reporting them on state public health websites, but there are many outlier states that are too far off from the CRDT case counts to be explained by a time lag:

  • 14 states: < 0.5 ratio of CDC to CRDT cases
  • 1 state > 1.5 ratio of CDC to CRDT cases (Alaska)
In [ ]:
#@title
crdt_merged_df.percent.sort_values()

Counties: CDC vs. NYT

In [93]:
#@title
# CDC vs. NYT county

cases_query = ('''
SELECT
  res_state,
  res_county,
  race_ethnicity_combined,
  COUNT(*) as cases
FROM
  %s
GROUP BY
   res_county,
   res_state,
   race_ethnicity_combined
''')
df = pd.io.gbq.read_gbq(cases_query % table, project_id=project_id)
df = df[df.res_state != 'PR']
df = df[df.res_state != 'GU']
df = df[df.res_state != 'VI']
df = df[df.res_state != 'MP']
df = df[df.res_state != 'AS']
In [94]:
#@title
# CDC vs. NYT county

project_id = 'msm-internal-data'

df_county_fips_map = pd.io.gbq.read_gbq(f'''
SELECT
*
FROM
#  `msm-internal-data.ipums_acs.acs_2019_5year_county`
  `msm-internal-data.crew.county_fips_mapping`
''', project_id=project_id)

df_county_fips_map.crew_county = df_county_fips_map.crew_county.str.lower()
df_county_fips_map['state_county'] = df_county_fips_map.state + '-' + df_county_fips_map.crew_county
df_county_fips_map['state_county'] = df_county_fips_map.state_county.astype('string').str.strip()
df_county_fips_map.set_index('state_county', inplace=True)
In [95]:
#@title
# Concatenate the state and county names because county names are not unique across states.
df.res_county = df.res_county.str.lower()
df['state_county'] = df.res_state + '-' + df.res_county
df['state_county'] = df.state_county.astype('string').str.strip()
df.set_index('state_county', inplace=True)
df['race_ethnicity_combined'] = df.race_ethnicity_combined.astype('string').str.strip()
df = df.replace(to_replace={'race_ethnicity_combined': {
    'Asian, Non-Hispanic': 'asian_cases',
    'Black, Non-Hispanic': 'black_cases',
    'White, Non-Hispanic': 'white_cases',
    'American Indian/Alaska Native, Non-Hispanic': 'aian_cases',
    'Hispanic/Latino': 'hispanic_cases',
    'Multiple/Other, Non-Hispanic': 'other_cases',
    'Native Hawaiian/Other Pacific Islander, Non-Hispanic': 'nhpi_cases',
    'Missing': 'unknown_cases',
    'Unknown': 'unknown_cases',
    'NA': 'na_cases',
    #'Yes': 'black_cases',  # for hc_work_yn
    #'No': 'white_cases',  # for hc_work_yn
  }}
)
In [96]:
#@title
merged_df = df.join(df_county_fips_map, on="state_county", how='inner', lsuffix='_left', rsuffix='_right')
#na_df = merged_df[merged_df['res_county'] == 'na']
#print(sum(na_df.cases))
#no_na_df = merged_df[merged_df['res_county'] != 'na']
#no_na_df = no_na_df[no_na_df['res_county'] != 'other']
#no_na_df = no_na_df[no_na_df['res_county'] != 'unknown']
#mismatch_df = no_na_df[no_na_df['county_fips'].isnull()]
#unique = pd.DataFrame(mismatch_df.index.unique())
#sum(mismatch_df.cases)
In [97]:
#@title
# Create a crosstab table with rows = counties, columns = race_ethnicity_combined.
crosstab_df = pd.crosstab(merged_df['county_fips'], merged_df.race_ethnicity_combined, values=merged_df.cases, aggfunc=sum,
                          margins=True,
                          margins_name='total_cases'
)
# Have to reset_index() to go from pandas multi-index to single index.
crosstab_df = crosstab_df.reset_index()
crosstab_df.drop(axis=0, index=len(crosstab_df) - 1, inplace=True)
crosstab_df['county_fips'] = crosstab_df.county_fips.astype(int)
crosstab_df['total_known_cases'] = crosstab_df['total_cases'] - crosstab_df.unknown_cases.fillna(0)
crosstab_df['total_known_cases'] = crosstab_df['total_cases'] - crosstab_df.na_cases.fillna(0) - crosstab_df.unknown_cases.fillna(0)
In [98]:
#@title
df_acs_name_lookup = pd.io.gbq.read_gbq(f'''
SELECT
  *
FROM
  `msm-internal-data.ipums_acs.acs_2019_5year_county`
''', project_id=project_id)

df_acs_name_lookup['state_county'] = df_acs_name_lookup.county.astype('string').str.strip() + ', ' + df_acs_name_lookup.state.astype('string').str.strip()
df_acs_name_lookup.drop(columns=['state', 'county'], inplace=True)
df_acs_name_lookup.set_index('county_fips', inplace=True)

county_chart_df = crosstab_df.join(df_acs_name_lookup, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
county_chart_df.county_fips = county_chart_df.county_fips.astype(int)
In [ ]:
#@title
print(len(county_chart_df))
print(county_chart_df.total_pop.sum())
print(county_chart_df.total_pop.sum() / 324697795)  # Population covered in these counties
print(0.55 * 324697795) # NYT population
In [100]:
#@title
nyt_counties_query = ('''
SELECT
  county_fips_code,
  confirmed_cases as nyt_cases,
FROM `bigquery-public-data.covid19_nyt.us_counties`
WHERE
  date = DATE(2020, 12, 16) AND
  county_fips_code IS NOT NULL
''')
nyt_counties_df = pd.io.gbq.read_gbq(nyt_counties_query, project_id=project_id)
nyt_counties_df.rename(columns={'county_fips_code': 'county_fips'}, inplace=True)
nyt_counties_df.county_fips.unique()
nyt_counties_df['county_fips'] = nyt_counties_df.county_fips.astype(int)
nyt_counties_df.set_index('county_fips', inplace=True)
In [101]:
#@title
county_chart_df.set_index('county_fips', inplace=True)
nyt_merged_df = county_chart_df.join(nyt_counties_df, on="county_fips", how='inner', lsuffix='_left', rsuffix='_right')
nyt_merged_df = nyt_merged_df.reset_index()
#nyt_merged_df.county_fips = nyt_merged_df.county_fips.astype(int)
nyt_merged_df['percent'] = round(nyt_merged_df.total_cases / nyt_merged_df.nyt_cases, 2)
#nyt_merged_df.reset_index(inplace=True)
In [102]:
#@title
tooltips = [alt.Tooltip('state_county:N', title='County'),
              #alt.Tooltip('crdt_known_cases:Q', format=',', title='CDC cases'),
              #alt.Tooltip('cdc_known_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('total_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('nyt_cases:Q', format=',', title='NYT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to NYUT'),
]

plot = alt.Chart(nyt_merged_df).mark_circle(size=60).encode(
    alt.X('nyt_cases:Q',
    #alt.X('crdt_known_cases:Q',
        scale=alt.Scale(domain=(0, 100000), clamp=True),
        axis=alt.Axis(title='NYT cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    alt.Y('total_cases:Q',
    #alt.Y('cdc_known_cases:Q',
        scale=alt.Scale(domain=(0, 100000), clamp=True),
        axis=alt.Axis(title='CDC cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    color=alt.Color('percent', scale=alt.Scale(scheme='blueorange',
                                               reverse=True,
                                               domain=[0, 2], clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'),
 tooltip=tooltips,
).properties(
     width=350,
    height=350,
)

line = pd.DataFrame({
    'x': [0, 100000],
    'y': [0, 100000],
    #'x': [0, 1200000],
    #'y': [0, 1200000],
})

line_plot = alt.Chart(line).mark_line(color= 'black').encode(
    x='x',
    y='y',
).properties(
     width=350,
    height=350,
)

scatter_clamp = (plot + line_plot).properties(title='Zoom in on Counties up to 100,000 population')

We can do the same analysis at the county level using the CDC vs. NYT data.

Each dot is a county (hover to see details). We show all counties on the left and zoom in on the smaller counties on the right.

In [92]:
#@title
tooltips = [alt.Tooltip('state_county:N', title='County'),
              #alt.Tooltip('crdt_known_cases:Q', format=',', title='CDC cases'),
              #alt.Tooltip('cdc_known_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('total_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('nyt_cases:Q', format=',', title='NYT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='CDC to NYT ratio'),
]

plot = alt.Chart(nyt_merged_df).mark_circle(size=60).encode(
    alt.X('nyt_cases:Q',
    #alt.X('crdt_known_cases:Q',
        scale=alt.Scale(domain=(0, 600000)),
        axis=alt.Axis(title='NYT cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    alt.Y('total_cases:Q',
    #alt.Y('cdc_known_cases:Q',
        scale=alt.Scale(domain=(0, 600000)),
        axis=alt.Axis(title='CDC cases')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    color=alt.Color('percent', scale=alt.Scale(scheme='blueorange',
                                               reverse=True,
                                               domain=[0, 2], clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'),
 tooltip=tooltips,
)

line = pd.DataFrame({
    'x': [0, 600000],
    'y': [0, 600000],
    #'x': [0, 1200000],
    #'y': [0, 1200000],
})

line_plot = alt.Chart(line).mark_line(color= 'black').encode(
    x='x',
    y='y',
).properties(
     width=350,
    height=350,
)

scatter_all = (plot + line_plot).properties(
    title='Ratio of CDC to NYT Cases by County as of Dec 16'
).properties(
     width=350,
    height=350,
)

(scatter_all | scatter_clamp
 ).configure_mark(stroke='grey'
 ).display()
In [ ]:
#@title
below_15 = len(nyt_merged_df[nyt_merged_df.percent < .85])
above_15 = len(nyt_merged_df[nyt_merged_df.percent > 1.15])
print(len(nyt_merged_df) - below_15 - above_15)
print('between +/-15%: ', round(
    1 - (below_15 / len(nyt_merged_df)) - (above_15 / len(nyt_merged_df)), 2))
below_50 = len(nyt_merged_df[nyt_merged_df.percent < .5])
above_50 = len(nyt_merged_df[nyt_merged_df.percent > 1.5])
print(len(nyt_merged_df) - below_50 - above_50)
print('between +/-50%: ', round(
    1 - (below_50 / len(nyt_merged_df)) - (above_50 / len(nyt_merged_df)), 2))
nyt_merged_df.percent.describe()

The ratio of CDC to NYT cases is between 0.00 and 9.80 for the 3,045 counties in the CDC data:

  • Average = 0.79
  • Median = 0.97
  • Min = 0.00
  • Max = 9.80 (Lake and Peninsula Borough, Alaska)
  • Percent between 0.85 and 1.15 = 48% (1,463 counties)
  • Percent between 0.50 and 1.50 = 70% (2,139 counties)

In other words, 48% of counties in the CDC data are within +/-15% of the NYT case counts and 70% are within +/-50% of those counts.

We can also view these ratios on a map. Note that legend only goes to 2.0, and all counties with a larger ratio are shown in the same dark blue.

In [133]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'county_fips'], empty='none')
tooltips = [alt.Tooltip('state_county:N', title='County'),
            alt.Tooltip('total_cases:Q', format=',', title='CDC cases'),
            alt.Tooltip('nyt_cases:Q', format=',', title='NYT cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to NYT')
]

#plot = alt.Chart(us_states).mark_geoshape(
plot = alt.Chart(us_counties).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(nyt_merged_df, 'county_fips', ['percent', 'state_county', 'total_cases', 'nyt_cases'])
      #from_=alt.LookupData(nyt_merged_df, 'county_fips', ['percent'])
  ).encode(
      alt.Color('percent',  
                type='quantitative', 
                legend=alt.Legend(format='.2f'),
                scale=alt.Scale(scheme='blueorange',
                                reverse=True,
                                domain=[0, 2],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='silver',
      stroke='white'
).project('albersUsa')

layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=900,
      height=650,
      title='Ratio of CDC Cases to NYT Cases as of Dec 16'
)

layered_map.configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()

Race/ethnicity

In [138]:
#@title
compare_cases_unknowns_query = ('''
SELECT
  res_state,
  race_ethnicity_combined,
  COUNT(*) as cdc_cases
FROM
  %s
GROUP BY
   res_state,
   race_ethnicity_combined
''')
states_df = pd.io.gbq.read_gbq(compare_cases_unknowns_query % table, project_id=project_id)
#states_df = states_df.replace(to_replace={'res_state': states_to_fips})
states_df = states_df[states_df.res_state != 'Unknown']
states_df = states_df[states_df.res_state != 'NA']
states_df = states_df[states_df.res_state != 'OCONUS']
#states_df = states_df.reset_index()

states_df['race_ethnicity_combined'] = states_df.race_ethnicity_combined.astype('string').str.strip()
states_df = states_df.replace(to_replace={'race_ethnicity_combined': {
    'Asian, Non-Hispanic': 'cdc_known_cases',
    'Black, Non-Hispanic': 'cdc_known_cases',
    'White, Non-Hispanic': 'cdc_known_cases',
    'American Indian/Alaska Native, Non-Hispanic': 'cdc_known_cases',
    'Hispanic/Latino': 'cdc_known_cases',
    'Multiple/Other, Non-Hispanic': 'cdc_known_cases',
    'Native Hawaiian/Other Pacific Islander, Non-Hispanic': 'cdc_known_cases',
    'Missing': 'cdc_unknown_cases',
    'Unknown': 'cdc_unknown_cases',
    'NA': 'cdc_na_cases',
    }})
states_df.rename(columns={'res_state': 'state'}, inplace=True)
#states_df['state_fips_code'] = states_df.state_fips_code.astype(int)
#states_df.set_index('state_fips_code', inplace=True)
In [139]:
#@title
crosstab_df = pd.crosstab(states_df['state'], states_df.race_ethnicity_combined, values=states_df.cdc_cases, aggfunc=sum,
                          margins=True,
                          margins_name='cdc_cases'
)
# Have to reset_index() to go from pandas multi-index to single index.
crosstab_df = crosstab_df.reset_index()
crosstab_df.drop(axis=0, index=len(crosstab_df) - 1, inplace=True)
crosstab_df['cdc_known_or_na_cases'] = crosstab_df['cdc_cases'] - crosstab_df.cdc_unknown_cases.fillna(0)
crosstab_df['cdc_known_cases'] = crosstab_df['cdc_cases'] - crosstab_df.cdc_na_cases.fillna(0) - crosstab_df.cdc_unknown_cases.fillna(0)
crosstab_df

crdt_merged_df = crosstab_df.join(crdt_df, on="state", how='inner', lsuffix='_left', rsuffix='_right')
crdt_merged_df.reset_index(inplace=True)
crdt_merged_df['state_fips_code'] = crdt_merged_df.state
crdt_merged_df = crdt_merged_df.replace(to_replace={'state_fips_code': states_to_fips})
crdt_merged_df['cdc_known_cases_percent'] = round(crdt_merged_df.cdc_known_cases / crdt_merged_df.cdc_cases, 4)
crdt_merged_df['cdc_known_or_na_cases_percent'] = round(crdt_merged_df.cdc_known_or_na_cases / crdt_merged_df.cdc_cases, 4)
crdt_merged_df['percent'] = round(crdt_merged_df.cdc_known_cases_percent / crdt_merged_df.crdt_known_cases_percent, 4)
crdt_merged_df['percent_with_na'] = round(crdt_merged_df.cdc_known_or_na_cases_percent / crdt_merged_df.crdt_known_cases_percent, 4)
crdt_merged_df['percent_counts'] = round(crdt_merged_df.cdc_known_cases / crdt_merged_df.crdt_known_cases, 4)
crdt_merged_df['one'] = round(crdt_merged_df.crdt_cases / crdt_merged_df.crdt_cases, 4)
crdt_merged_df_no_ny = crdt_merged_df[crdt_merged_df.state != 'NY']

We can now look at the completeness of the CDC data for race/ethnicity on its own and in comparison to the CRDT data at the state level.

States: CDC vs. CRDT

We can first compare the number of cases with known race/ethnicity within each state between the CDC and CRDT data.

In [140]:
#@title
tooltips = [alt.Tooltip('state:N', title='State'),
              #alt.Tooltip('crdt_known_cases:Q', format=',', title='CDC cases'),
              #alt.Tooltip('cdc_known_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('cdc_known_cases:Q', format=',', title='CDC known cases'),
              alt.Tooltip('crdt_known_cases:Q', format=',', title='CRDT known cases'),
              alt.Tooltip('percent_counts:Q', format='.2f', title='Ratio of CDC to CRDT'),
]

plot = alt.Chart(crdt_merged_df).mark_circle(size=60).encode(
    alt.X('crdt_known_cases:Q',
    #alt.X('crdt_known_cases:Q',
        #scale=alt.Scale(domain=(0, 1)),
        axis=alt.Axis(title='CRDT known race/ethnicity cases'),
        scale=alt.Scale(domain=(0, 1200000))
    ),
    alt.Y('cdc_known_cases:Q',
    #alt.Y('cdc_known_cases:Q',
        #scale=alt.Scale(domain=(0, 1)),
        axis=alt.Axis(title='CDC known race/ethnicity cases'),
        scale=alt.Scale(domain=(0, 1200000))
    ),
    color=alt.Color('percent_counts', scale=alt.Scale(scheme='blueorange',
                                                      reverse=True,
                                                      domain=[0, 2], clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'
                    ),
 tooltip=tooltips,
)

line = pd.DataFrame({
    #'x': [0, 1],
    #'y': [0, 1],
    'x': [0, 1200000],
    'y': [0, 1200000],
})

line_plot = alt.Chart(line).mark_line(color= 'black').encode(
    x='x',
    y='y',
).properties(
    height=350,
    width=350
)

scatter = (plot + line_plot).properties(
    title='Ratio of CDC to CRDT Cases with Known Race/Ethnicity by State as of Dec 16'
    ).configure_mark(stroke='grey'
)
scatter.display()
In [ ]:
#@title
below_15 = len(crdt_merged_df_no_ny[crdt_merged_df_no_ny.percent_counts < .85])
above_15 = len(crdt_merged_df_no_ny[crdt_merged_df_no_ny.percent_counts > 1.15])
print (len(crdt_merged_df_no_ny) - above_15 - below_15)
print('between +/-15%: ', round(1 - below_15 / len(crdt_merged_df_no_ny) - above_15 / len(crdt_merged_df_no_ny), 2))
below_50 = len(crdt_merged_df_no_ny[crdt_merged_df_no_ny.percent_counts < .5])
above_50 = len(crdt_merged_df_no_ny[crdt_merged_df_no_ny.percent_counts > 1.55])
print (len(crdt_merged_df_no_ny) - above_50 - below_50)
print('between +/-50%: ', round(1 - below_50 / len(crdt_merged_df_no_ny) - above_50 / len(crdt_merged_df_no_ny), 2))
crdt_merged_df_no_ny.percent_counts.describe()

The ratio of CDC to CRDT cases with known race/ethnicity is between 0.01 and 1.18 for all states excluding New York, which has 0 known cases in CRDT.

  • Average = 0.63
  • Median = 0.77
  • Min = 0.01 (North Dakota, Louisiana, Wyoming)
  • Max = 1.18 (Massachusetts)
  • Percent between 0.85 and 1.15 = 34% (17 states)
  • Percent between 0.50 and 1.50 = 64% (32 states)

Only 4 states (Massachusetts, Minnesota, Utah, Washington) had more cases with known race/ethnicity in the CDC data than in the CRDT data, whereas 23 states had more total cases in the CDC data than in the CRDT data. We can again view this comparison on a map:

In [156]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('cdc_known_cases:Q', format=',', title='CDC known cases'),
              alt.Tooltip('crdt_known_cases:Q', format=',', title='CRDT known cases'),
              alt.Tooltip('percent_counts:Q', format='.2f', title='Ratio of CDC to CRDT'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent_counts', 'state', 'cdc_known_cases', 'crdt_known_cases'])
  ).encode(
      alt.Color('percent_counts',  
                type='quantitative', 
                legend=alt.Legend(format='.2f'),
                scale=alt.Scale(scheme='blueorange',
                                reverse=True,
                                domain=[0, 2],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='silver',
      stroke='white'
).project('albersUsa')

layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=500,
      height=400,
      title='Ratio of CDC to CRDT Cases with Known Race/Ethnicity as of Dec 16'
).configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
)
layered_map.display()
In [ ]:
#@title
crdt_merged_df_no_ny.percent_counts.sort_values()

What accounts for the differences between the CDC and CRDT for the number of cases with known race/ethnicity?

  1. Total case counts, which we examined in the Total Cases comparison above
  2. The percent of total cases with known race/ethnicity

Comparing the number of cases with known race/ethnicity combines these two factors into one. We can separate the factors by comparing the total case counts by state, which we already did above, and separately comparing the percentage of cases with known race/ethnicity by state.

In [155]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('cdc_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('cdc_known_cases:Q', format=',', title='Known race/ethnicity cases'),
              alt.Tooltip('cdc_known_cases_percent:Q', format='.1%', title='Percent known cases'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent', 'state', 'cdc_cases', 'cdc_known_cases', 'cdc_known_cases_percent'])
  ).encode(
      alt.Color('cdc_known_cases_percent',  
                type='quantitative', 
                legend=alt.Legend(format='.0%'),
                scale=alt.Scale(scheme='redyellowblue',
                                domain=[0, 1],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='#black',
      stroke='white'
).project('albersUsa')

cdc_known_layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=450,
      height=350,
      title='Percent of CDC Cases with Known Race/Ethnicity as of Dec 16'
)
In [157]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('crdt_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('crdt_known_cases:Q', format=',', title='Known race/ethnicity cases'),
              alt.Tooltip('crdt_known_cases_percent:Q', format='.1%', title='Percent known cases'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent', 'state', 'crdt_cases', 'crdt_known_cases', 'crdt_known_cases_percent'])
  ).encode(
      alt.Color('crdt_known_cases_percent',  
                type='quantitative', 
                legend=alt.Legend(format='.0%'),
                scale=alt.Scale(scheme='redyellowblue',
                                domain=[0, 1],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='#black',
      stroke='white'
).project('albersUsa')

crdt_known_layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=450,
      height=350,
      title='Percent of CRDT Cases with Known Race/Ethnicity as of Dec 16'
)
(cdc_known_layered_map | crdt_known_layered_map).configure_legend(
      orient='top',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()
In [159]:
#@title
tooltips = [alt.Tooltip('state:N', title='State'),
              #alt.Tooltip('crdt_known_cases:Q', format=',', title='CDC cases'),
              #alt.Tooltip('cdc_known_cases:Q', format=',', title='CRDT cases'),
              alt.Tooltip('cdc_known_cases_percent:Q', format='.0%', title='CDC known cases'),
              alt.Tooltip('crdt_known_cases_percent:Q', format='.0%', title='CRDT known cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to CRDT'),
]

plot = alt.Chart(crdt_merged_df).mark_circle(size=60).encode(
    alt.X('crdt_known_cases_percent:Q',
    #alt.X('crdt_known_cases:Q',
        scale=alt.Scale(domain=(0, 1)),
        axis=alt.Axis(title='CRDT cases', format='.0%')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    alt.Y('cdc_known_cases_percent:Q',
    #alt.Y('cdc_known_cases:Q',
        scale=alt.Scale(domain=(0, 1)),
        axis=alt.Axis(title='CDC cases', format='.0%')
        #scale=alt.Scale(domain=(0, 1200000))
    ),
    color=alt.Color('percent', scale=alt.Scale(scheme='blueorange',
                                               reverse=True,
                                               domain=[0, 2], clamp=True),
                    legend=alt.Legend(format='.2f'),
                    title='Ratio'
                    ),
 tooltip=tooltips,
)

line = pd.DataFrame({
    'x': [0, 1],
    'y': [0, 1],
    #'x': [0, 1200000],
    #'y': [0, 1200000],
})

line_plot = alt.Chart(line).mark_line(color= 'black').encode(
    x='x',
    y='y',
).properties(
    height=350,
    width=350
)

scatter = (plot + line_plot).properties(
    title='Ratio of CDC to CRDT Percent of Cases with Known Race/Ethnicity by State as of Dec 16'
    )

scatter.configure_mark(stroke='grey').display()
In [160]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('cdc_known_cases_percent:Q', format='.0%', title='CDC known cases'),
              alt.Tooltip('crdt_known_cases_percent:Q', format='.0%', title='CRDT known cases'),
              alt.Tooltip('percent:Q', format='.2f', title='Ratio of CDC to CRDT'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent', 'state', 'cdc_known_cases_percent', 'crdt_known_cases_percent'])
  ).encode(
      alt.Color('percent',  
                type='quantitative', 
                legend=alt.Legend(format='.2f'),
                scale=alt.Scale(scheme='blueorange',
                                reverse=True,
                                domain=[0, 2],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='silver',
      stroke='white'
).project('albersUsa')

layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=500,
      height=400,
      title='Ratio of CDC to CRDT Percent of Cases with Known Race/ethnicity as of Dec 16'
).configure_legend(
      orient='top-right',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()

When calculating disparities between different race/ethnicity groups, we will need to be cautious to draw conclusions from data in states where there is race/ethnicity data for a small percentage of the population and/or the overall case totals are incomplete. For example, California only has 21% of cases with race/ethnicity, and 0% of the cases are Hispanic/Latino (96 out of 1.9M), which doesn't match state or local reporting.

Counties: CDC

We don't have a point of comparison for the known race/ethnicity percentage at the county level, as we do at the state level, but we can look at the percentage of cases with known/race ethnicity in the CDC data on its own to see the variation across states and counties.

In [44]:
#@title
def GenerateColNames(group):
  cases_col = group + '_cases'
  pop_col = group + '_pop'
  pop_percent_col = group + '_percent'
  cases_percent_col = group + '_cases_percent'
  cases_percent_with_unknown_col = group + '_cases_percent_with_unknown'
  cases_per_100_col = group + '_cases_per_100'
  cases_to_pop_col= group + '_cases_to_pop'
  cases_to_pop_with_unknown_col= group + '_cases_to_pop_with_unknown'
  return {'cases': cases_col,
          'pop': pop_col,
          'pop_percent': pop_percent_col,
          'cases_per_100': cases_per_100_col,
          'cases_percent': cases_percent_col,
          'cases_percent_with_unknown': cases_percent_with_unknown_col,
          'cases_to_pop': cases_to_pop_col,
          'cases_to_pop_with_unknown': cases_to_pop_with_unknown_col,
  }

group_names = {}
for group in race_ethnicity_groups:
  group_names[group] = GenerateColNames(group)

chart_df = county_chart_df.copy(deep=True)
chart_df.reset_index(inplace=True)
chart_df.county_fips = chart_df.county_fips.astype(int)
for group in race_ethnicity_groups:
  chart_df[group_names[group]['cases_per_100']] = round(chart_df[group_names[group]['cases']] / chart_df[group_names[group]['pop']], 4)
  chart_df[group_names[group]['cases_percent']] = round(chart_df[group_names[group]['cases']] / chart_df.total_known_cases, 2)
  chart_df[group_names[group]['cases_percent_with_unknown']] = round(chart_df[group_names[group]['cases']] / chart_df.total_cases, 2)
  chart_df[group_names[group]['cases_to_pop']] = round(
      chart_df[group_names[group]['cases_percent']] / chart_df[group_names[group]['pop_percent']], 2)
  chart_df[group_names[group]['cases_to_pop_with_unknown']] = round(
      chart_df[group_names[group]['cases_percent_with_unknown']] / chart_df[group_names[group]['pop_percent']], 2)
chart_df['percent_known_cases'] = round(chart_df.total_known_cases / chart_df.total_cases, 2)
chart_df['percent_known_or_na_cases'] = round((chart_df.total_known_cases + chart_df.na_cases) / chart_df.total_cases, 2)
In [168]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

filter_data = False
#MIN_POP_PERCENT = 0.001
MIN_POP = 100
MIN_PERCENT_KNOWN = 0.5
MIN_CASES = 5

group_to_display_name = {
    'black': 'Black',
    'white': 'White',
    'hispanic': 'Hispanic/Latino',
    'asian': 'Asian',
    'nhpi': 'Native Hawaiian/Pacific Islander',
    'aian': 'American Indian/Alaska Native',
    'other': 'Other or multiple race/ethnicity',
    'total': 'Total'
}

group_to_short_name = {
    'black': 'Black',
    'white': 'White',
    'hispanic': 'Hispanic',
    'asian': 'Asian',
    'nhpi': 'NHPI',
    'aian': 'AIAN',
    'other': 'Other',
    'total': 'Total'
}

chart_col_to_color_scheme = {
    'cases_per_100': 'yelloworangebrown',
    'cases_to_pop': 'blueorange',
    'percent_known_cases': 'redyellowblue',
    'percent_known_or_na_cases': 'redyellowblue',
}
chart_col_to_legend_format = {
    'cases_per_100': '.0%',
    'cases_to_pop': '.1f',
    'percent_known_cases': '.0%',
    'percent_known_or_na_cases': '.0%',
}

def GenerateCountyMap(chart_df, chart_col, group, group_names, metric, date):
  group_chart_col = chart_col
  if group:
    group_chart_col = group_names[group][chart_col]
  group_display_name = ''
  if group:
    group_display_name = group_to_short_name[group]
    group_short_name = group_to_short_name[group]
  
  chart_col_to_range = {
    'cases_per_100': [0, .2],
    'cases_to_pop': [0, 2],
    'percent_known_cases': [0, 1],
    'percent_known_or_na_cases': [0, 1],
  }    

  prevalence_text = 'who had COVID-19'

  col_to_title = {
      'total_cases': group_display_name + ' ' + metric + ' as of ' + date,
      'cases_per_100': 'Percent of ' + group_display_name + ' Population ' + prevalence_text + ' as of ' + date,
      'cases_to_pop': 'Ratio of ' + group_display_name + ' ' + metric + ' Share to Population Share'  + ' as of ' + date,
      'percent_known_cases': 'Percent of ' + metric + ' with Known Race/Ethnicity' + ' as of ' + date,
      'percent_known_or_na_cases': 'Percent of ' + metric + ' with Known or Suppressed Race/Ethnicity' + ' as of ' + date,
  }

  filtered_chart_df = chart_df
  if group and filter_data:
    #filtered_chart_df = filtered_chart_df[filtered_chart_df[group_names[group]['pop_percent']] > MIN_POP_PERCENT]
    filtered_chart_df = filtered_chart_df[filtered_chart_df[group_names[group]['pop']] > MIN_POP]
    filtered_chart_df = filtered_chart_df[filtered_chart_df['percent_known_cases'] > MIN_PERCENT_KNOWN]
    filtered_chart_df = filtered_chart_df[filtered_chart_df[group_names[group]['cases']] > MIN_CASES]

  highlight = alt.selection_single(on='mouseover', fields=['id', 'county_fips'], empty='none')

  data_cols = ['state_county',
               'percent_known_cases',
               'percent_known_or_na_cases',
               'total_cases']
  if group:
    data_cols.extend([
                      group_names[group]['cases'],
                      group_names[group]['pop'],
                      group_names[group]['pop_percent'],
                      group_names[group]['cases_per_100'],
                      group_names[group]['cases_percent'],
                      group_names[group]['cases_percent_with_unknown'],
                      group_names[group]['cases_to_pop'],
                      group_names[group]['cases_to_pop_with_unknown'],
                      ])

  tooltips = [alt.Tooltip('state_county:N', title='County'),
              alt.Tooltip('percent_known_cases:Q', format='.0%', title=metric + ' with race/ethnicity')
  ]
  if chart_col in ('percent_known_cases', 'percent_known_or_na_cases'):
    tooltips.extend([
               alt.Tooltip('total_cases:Q', format=',.0f', title=metric)
   ])
  if chart_col == 'percent_known_or_na_cases':
    tooltips.extend([
               alt.Tooltip('percent_known_or_na_cases:Q', format='.0%',
                           title=metric + ' with known or suppressed race/ethnicity')
   ])
  if group:
    tooltips.extend([
                alt.Tooltip(group_names[group]['cases'] + ':Q', format=',',
                            title=group_short_name + ' ' + metric.lower()),
    ])
    if chart_col == 'cases_per_100':
      tooltips.extend([
                  alt.Tooltip(group_names[group]['pop'] + ':Q', format=',',
                            title=group_short_name + ' population'),
                  alt.Tooltip(group_names[group]['cases_per_100'] + ':Q', format='.2%',
                              title='Percent ' + prevalence_text)
      ])
    elif chart_col == 'cases_to_pop':
      tooltips.extend([
                  alt.Tooltip(group_names[group]['cases_percent_with_unknown'] + ':Q', format='.1%',
                              title='Percent of total ' + metric.lower()),
                  alt.Tooltip(group_names[group]['cases_percent'] + ':Q', format='.1%',
                              title='Percent of known race/ethnicity ' + metric.lower()),
                  alt.Tooltip(group_names[group]['pop_percent'] + ':Q', format='.1%',
                              title=group_short_name + ' percent of population'),
                  alt.Tooltip(group_names[group]['cases_to_pop'] + ':Q', format='.2f',
                              title='Ratio of percent of known ' + metric.lower() + ' to percent of population'),
                  #alt.Tooltip(group_names[group]['cases_to_pop_with_unknown'] + ':Q', format='.2f',
                  #            title='Ratio of ' + metric.lower() + ' to population including unknowns'),
      ])
  reverse_scale = False
  if chart_col == 'cases_to_pop':
    reverse_scale = True

  plot = alt.Chart(us_counties).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(filtered_chart_df, 'county_fips', data_cols)
  ).encode(
      alt.Color(group_chart_col,  
                type='quantitative', 
                legend=alt.Legend(format=chart_col_to_legend_format[chart_col]),
                scale=alt.Scale(scheme=chart_col_to_color_scheme[chart_col],
                                reverse=reverse_scale,
                                domain=chart_col_to_range[chart_col],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

  states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
  )

  states_fill = alt.Chart(us_states).mark_geoshape(
      fill='silver',
      stroke='white'
  ).project('albersUsa')

  layered_map = alt.layer(states_fill, plot, states_outline).properties(
      title=col_to_title[chart_col],
  )
  return layered_map
In [169]:
#@title
group_charts = {'cases_per_100': {}, 'cases_to_pop': {}}

known_percent = GenerateCountyMap(
    chart_df, 'percent_known_cases', None, group_names, metric, date)
percent_known_or_na_cases = GenerateCountyMap(
    chart_df, 'percent_known_or_na_cases', None, group_names, metric, date)

for group in race_ethnicity_groups:
  for value in ('cases_per_100', 'cases_to_pop'):
    group_charts[value][group] = GenerateCountyMap(
        chart_df, value, group, group_names, metric, date)
In [47]:
#@title
known_percent.configure_legend(
      orient='top-right',
      gradientLength=400,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
  ).properties(
    width=900,
    height=650,
  ).display()

Disparities Analysis

U.S.: Race/ethnicity and age

We can now look at disparities between different race/ethnicity groups for the entire U.S.

In the chart below, AIAN stands for "American Indian / Alaska Native" and "NHPI" stands for "Native Hawaiian / Pacific Islander."

In [161]:
#@title

test = pd.DataFrame.from_dict({'group': [
                                         'Black', 'Hispanic/Latino', 'White', 'Asian/NHPI', 'AIAN', '-Total-',
                                         ],
                               'percent': [.0234, .0254, .0211, .0129, .0376, .0409]})
alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', axis=alt.Axis(format='.1%'), title=''),
      y=alt.Y('group', sort='-x', title=''),
      color=alt.Color('group', 
                      scale=alt.Scale(scheme='category20'),
                      title=''),
      order=alt.Order('percent:N'),
      tooltip=[
                  alt.Tooltip('group:N', title='Field'),
                  alt.Tooltip('percent:Q', format='.2%', title='Cases in race/ethnicity group'),
      ]
).properties(
    title='Percent of Race/Ethnicity Group in the U.S. who had COVID-19 as of Dec 16'
).display()

The CDC case data shows 4.09% of the U.S. population having had COVID-19, whereas the CRDT data shows 5.12% of the U.S. population having had COVID-19 up to Dec 16 (based on the CDC data only having 80% of the total cases in the CRDT data). Note that the Total group is larger than all of the other groups because it also includes the 45% of cases in the data that didn't have known race/ethnicity.

We can also look at the percent of each age and race/ethnicity group who had COVID-19.

In [163]:
#@title
# Age x race numbers come from a spreadsheet with manual calculations using BQ and the ACS 1-year estimates.
# Asian and NHPI are combined because the IPUMS data used to calculate the age categories (not available in ACS API)
# split out Asian subgroups, including "Other Asian or Pacific Islander," and so I combined them into one category.
# https://usa.ipums.org/usa-action/variables/RACE#codes_section

race_list = ['Black'] * 9
race_list.extend(['Hispanic/Latino'] * 9)
race_list.extend(['White'] * 9)
race_list.extend(['Asian/NHPI'] * 9)
race_list.extend(['AIAN'] * 9)
race_list.extend(['-Total-'] * 9)
test = pd.DataFrame.from_dict({'group': race_list,
                               'age': ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80+'] * 6,
                               'percent': [
                                           0.0072,0.0136,0.0275,0.0297,0.0284,0.0285,0.0252,0.0252,0.0325,  # Black
                                           0.0086,0.0172,0.0331,0.0328,0.0346,0.0328,0.0261,0.022,0.0247,  # Hispanic/Latino
                                           0.0063,0.0188,0.0302,0.0234,0.0236,0.0227,0.0184,0.0181,0.0276,  # White
                                           0.0168,0.022,0.0282,0.0234,0.0229,0.027,0.0273,0.032,0.0508,  # Asian/NHPI
                                           0.0169,0.0315,0.0507,0.053,0.047,0.0387,0.0307,0.0251,0.0268,  # AIAN
                                           0.0134,0.0319,0.0573,0.0497,0.0494,0.0463,0.0356,0.0316,0.0453,  # All
                               ]})
alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', sort='y', axis=alt.Axis(format='.0%'), title=''),
      y=alt.Y('age', title='Age'),
      column=alt.Column('group',
                        title='Percent of Age and Race/Ethnicity Group in the U.S. who had COVID-19 as of Dec 16',
                        header=alt.Header(titleFontSize=13)),
      color=alt.Color('group', scale=alt.Scale(scheme='category20'), title='Race/Ethnicity',
                      ),
      order=alt.Order('group:N'),
      tooltip=[
                  alt.Tooltip('group:N', title='Race/Ethnicity group'),
                  alt.Tooltip('age:N', title='Age'),
                  alt.Tooltip('percent:Q', format='.1%', title='Cases in age group'),
      ]
).properties(
  width=110, 
).display()

We can see above that people age 20-29 are more likely to get COVID-19 than any other age group in almost all race/ethnicity groups. Because different race/ethnicity groups have different age compositions, splitting the cases into cases per age and race/ethnicity group allows us to compare race/ethnicity data against each other without the different age compositions complicating the comparison.

We can also look at the age-adjusted case rates, which uses a standard age composition across all race/ethnicity groups to weight the values within each age group. This allows us to compare the rate of COVID-19 within each race/ethnicity group and remove age composition differences as a factor from the comparison. The CDC has published age-adjusted prevalence data for deaths, but not for cases.

We can see below that, unlike for COVID-19 deaths, the crude and age-adjusted numbers are fairly similar within each race/ethnicity group except for Asian/NHPI, where the age-adjusted rate is 1.2 percentage points higher. Note that we combined those Asian and NHPI into one category to calculate age-adjusted numbers due to the availability of Census/ACS data with those age and race/ethnicity breakdowns.

In [164]:
#@title
test = pd.DataFrame.from_dict({'group': [
                                         'Black', 'Hispanic/Latino', 'White', 'Asian/NHPI', 'AIAN', '-Total-',
                                         'Black', 'Hispanic/Latino', 'White', 'Asian/NHPI', 'AIAN', '-Total-'
                                         ],
                               'measure': ['Crude', 'Crude', 'Crude', 'Crude', 'Crude', 'Crude',
                                         'Age Adjusted', 'Age Adjusted', 'Age Adjusted', 'Age Adjusted', 'Age Adjusted', 'Age Adjusted'],
                               'percent': [.0234, .0254, .0211, .0129, .0376, .0409,
                                           .0231, .0260, .0206, .0249, .0378, .0403]})
alt.Chart(test).mark_bar().encode(
      x=alt.X('percent', axis=alt.Axis(format='.1%'), title=''),
      y=alt.Y('measure', sort='x', title=''),
      row=alt.Row('group', title='Race/ethnicity'),
      color=alt.Color('measure', title='', scale=alt.Scale(scheme='category20')),
      order=alt.Order('group:N'),
      tooltip=[
                  alt.Tooltip('group:N', title='Field'),
                  alt.Tooltip('measure:N', title='Measure'),
                  alt.Tooltip('percent:Q', format='.2%', title='Cases in race/ethnicity group'),
      ]
).properties(
    title='Percent of Race/Ethnicity Group in the U.S. who had COVID-19 as of Dec 16'
).display()

Counties: Race/ethnicity

We can now examine the race/ethnicity disparities at the county level. We first look at the percentage of people who had COVID-19 within each county. We show all counties with data even if they have a small percentage of cases with known race/ethnicity. Note that the legend only goes to 20%, so counties with a higher rate will be shown in the same dark brown color. You can hover over the counties in the map for more details.

Larger versions of these maps for hovering over smaller counties are available here.

In [51]:
#@title
black = group_charts['cases_per_100']['black'].properties(width=450, height=325)
hispanic = group_charts['cases_per_100']['hispanic'].properties(width=450, height=325)
white = group_charts['cases_per_100']['white'].properties(width=450, height=325)
asian = group_charts['cases_per_100']['asian'].properties(width=450, height=325)
aian = group_charts['cases_per_100']['aian'].properties(width=450, height=325)
nhpi = group_charts['cases_per_100']['nhpi'].properties(width=450, height=325)

((black | hispanic) &
 (white | asian) &
 (aian | nhpi)).configure_legend(
      orient='top',
      gradientLength=400,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
  ).display()

We can also view disparities by comparing the percentage of total cases that a race/ethnicity group accounts for in a county (the cases share) vs. the percentage of the total population that a race/ethnicity accounts for in a county (the population share). There is no disparity when the cases share is equal to the population share for all race/ethnicity groups in a county (ratio = 1.0). When the ratio of cases share to population share is above 1.0, then a group has a disproportionate number of cases relative to its share of the population.

In [52]:
#@title
black = group_charts['cases_to_pop']['black'].properties(width=450, height=325)
hispanic = group_charts['cases_to_pop']['hispanic'].properties(width=450, height=325)
white = group_charts['cases_to_pop']['white'].properties(width=450, height=325)
asian = group_charts['cases_to_pop']['asian'].properties(width=450, height=325)
aian = group_charts['cases_to_pop']['aian'].properties(width=450, height=325)
nhpi = group_charts['cases_to_pop']['nhpi'].properties(width=450, height=325)

((black | hispanic) &
 (white | asian) &
 (aian | nhpi)).configure_legend(
      orient='top',
      gradientLength=400,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
  ).display()

Appendix

Privacy suppression of race/ethnicity

In [165]:
#@title
us_states = alt.topo_feature(data.us_10m.url, 'states')
us_counties = alt.topo_feature(data.us_10m.url+"#", 'counties')

highlight = alt.selection_single(on='mouseover', fields=['id', 'state_fips_code'], empty='none')

tooltips = [alt.Tooltip('state:N', title='State'),
              alt.Tooltip('cdc_cases:Q', format=',', title='CDC cases'),
              alt.Tooltip('cdc_known_or_na_cases:Q', format=',', title='Known or suppressed race/ethnicity cases'),
              alt.Tooltip('cdc_known_or_na_cases_percent:Q', format='.1%', title='Percent known or suppressed cases'),
]

plot = alt.Chart(us_states).mark_geoshape(
      stroke='white',
      strokeOpacity=.2,
      strokeWidth=1
  ).project(
    type='albersUsa'
  ).transform_lookup(
      lookup='id',
      from_=alt.LookupData(crdt_merged_df, 'state_fips_code', ['percent', 'state', 'cdc_cases', 'cdc_known_or_na_cases', 'cdc_known_or_na_cases_percent'])
  ).encode(
      alt.Color('cdc_known_or_na_cases_percent',  
                type='quantitative', 
                legend=alt.Legend(format='.0%'),
                scale=alt.Scale(scheme='redyellowblue',
                                domain=[0, 1],
                                clamp=True,
                                ),
                title=''),
       tooltip=tooltips
  ).add_selection(
      highlight,
  )

states_outline = alt.Chart(us_states).mark_geoshape(stroke='white', strokeWidth=1.5, fillOpacity=0, fill='white').project(
      type='albersUsa'
)

states_fill = alt.Chart(us_states).mark_geoshape(
      fill='#black',
      stroke='white'
).project('albersUsa')

cdc_known_or_na_layered_map = alt.layer(states_fill, plot, states_outline).properties(
      width=450,
      height=350,
      title='Percent of CDC Cases with Known or Suppressed Race/Ethnicity as of Dec 16'
)

What about the 2% of cases with unknown race/ethnicity due to suppression for privacy reasons? Are some states more subject to privacy suppression, and so when assessing completeness, we should take into account the cases that did have race/ethnicity but were suppressed?

If we look at the percentage of cases that either have known or suppressed race/ethnicity, we can see that only a few states improve significantly in their data completeness, such as Wyoming and Louisiana.

In [166]:
#@title
(cdc_known_or_na_layered_map | cdc_known_layered_map).configure_legend(
      orient='top',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()

If we want to assess individual counties, we can look at the percent of cases with known or suppressed race/ethnicity at the county level. We can expect more improvements at the county level than the state level due to smaller population sizes counties that leads to more suppression.

In [55]:
#@title
(percent_known_or_na_cases.properties(
    width=450,
    height=350
) | known_percent.properties(
    width=450,
    height=350
)).configure_legend(
      orient='top',
      gradientLength=200,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
).display()

Here is a larger version of the chart on the left to more easily zoom in on different counties.

In [167]:
#@title
percent_known_or_na_cases.configure_legend(
      orient='top-right',
      gradientLength=400,
      titleLimit=0,
  ).configure_view(
      strokeWidth=0,
  ).properties(
    width=900,
    height=650,
  ).display()
In [ ]:
# Large county-level disparity maps
#@title
for group in ['black', 'hispanic', 'white', 'asian', 'aian', 'nhpi']:
  (group_charts['cases_per_100'][group]).properties(
      width=900,
      height=650,
  ).configure_legend(
      orient='top-right',
      gradientLength=400,
      titleLimit=0,
      padding=0
  ).configure_view(
      strokeWidth=0,
  ).display()
In [ ]:
#@title
for group in ['black', 'hispanic', 'white', 'asian', 'aian', 'nhpi']:
  (group_charts['cases_to_pop'][group]).properties(
      width=900,
      height=650,
  ).configure_legend(
      orient='top-right',
      gradientLength=400,
      titleLimit=0,
      padding=0
  ).configure_view(
      strokeWidth=0,
  ).display()